An Arabic-Moroccan Darija Code-Switched Corpus
نویسندگان
چکیده
In multilingual communities, speakers often switch between languages or dialects within the same context. This phenomenon is called code-switching. It can be observed, e.g., in the Arab world, where Modern Standard Arabic and Dialectal Arabic coexist. Recently, the computational treatment of code-switching has received attention. Just as other natural language processing tasks, this task requires annotated linguistic resources. In our work, we turn to a particular under-resourced Arabic Dialect, Moroccan Darija. While other dialects such as Egyptian Arabic have received their share of attention, very limited effort has been devoted to the development of basic linguistic resources that would support a computational treatment of Darija. Motivated by these considerations, we describe our effort in the development and annotation of a large scale corpus collected from Moroccan social media sources, namely blogs and internet discussion forums. It has been annotated on token-level by three Darija native speakers. Crowd-sourcing has not been used. The final corpus has a size of 223k tokens. It is, to our knowledge, currently the largest resource of its kind.
منابع مشابه
Finding Romanized Arabic Dialect in Code-Mixed Tweets
Recent computational work on Arabic dialect identification has focused primarily on building and annotating corpora written in Arabic script. Arabic dialects however also appear written in Roman script, especially in social media. This paper describes our recent work developing tweet corpora and a token-level classifier that identifies a romanized Arabic dialect and distinguishes it from French...
متن کاملTweet Conversation Annotation Tool with a Focus on an Arabic Dialect, Moroccan Darija
This paper presents the DATOOL, a graphical tool for annotating conversations consisting of short messages (i.e., tweets), and the results we obtain in using it to annotate tweets for Darija, an historically unwritten Arabic dialect spoken by millions but not taught in schools and lacking standardization and linguistic resources. With the DATOOL, a native-Darija speaker annotated hundreds of mi...
متن کاملAn Algerian Arabic-French Code-Switched Corpus
Arabic is not just one language, but rather a collection of dialects in addition to Modern Standard Arabic (MSA). While MSA is used in formal situations, dialects are the language of every day life. Until recently, there was very little dialectal Arabic in written form. With the advent of social-media, however, the landscape has changed. We provide the first romanized code-switched Algerian Ara...
متن کاملMorphologically Annotated Corpora and Morphological Analyzers for Moroccan and Sanaani Yemeni Arabic
We present new language resources for Moroccan and Sanaani Yemeni Arabic. The resources include corpora for each dialect which have been morphologically annotated, and morphological analyzers for each dialect which are derived from these corpora. These are the first sets of resources for Moroccan and Yemeni Arabic. The resources will be made available to the public.
متن کاملAddressing Code-Switching in French/Algerian Arabic Speech
This study focuses on code-switching (CS) in French/Algerian Arabic bilingual communities and investigates how speech technologies, such as automatic data partitioning, language identification and automatic speech recognition (ASR) can serve to analyze and classify this type of bilingual speech. A preliminary study carried out using a corpus of Maghrebian broadcast data revealed a relatively hi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016